NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Active Learning Design Choices for NER with Transformers

Vacareanu, Robert; Noriega-Atala, Enrique; Hahn-Powell, Gus; Valenzuela-Escarcega, Marco A; Surdeanu, Mihai (May 2024, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024))
Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani; Xue, Nianwen (Ed.)
We explore multiple important choices that have not been analyzed in conjunction regarding active learning for token classification using transformer networks. These choices are: (i) how to select what to annotate, (ii) decide whether to annotate entire sentences or smaller sentence fragments, (iii) how to train with incomplete annotations at token-level, and (iv) how to select the initial seed dataset. We explore whether annotating at sub-sentence level can translate to an improved downstream performance by considering two different sub-sentence annotation strategies: (i) entity-level, and (ii) token-level. These approaches result in some sentences being only partially annotated. To address this issue, we introduce and evaluate multiple strategies to deal with partially-annotated sentences during the training process. We show that annotating at the sub-sentence level achieves comparable or better performance than sentence-level annotations with a smaller number of annotated tokens. We then explore the extent to which the performance gap remains once accounting for the annotation time and found that both annotation schemes perform similarly.
more » « less
Full Text Available
Proceedings of the 2nd Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning (PAN-DL 2023)

Surdeanu, Mihai; Riloff, Ellen; Chiticariu, Laura; Frietag, Dayne; Hahn-Powell, Gus; Morrison, Clayton T; Noriega-Atala, Enrique; Sharp, Rebecca; Valenzuela-Escárcega, Marco (December 2023, Proceedings of the 2nd Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning)

Message from the Organizers Welcome to the second edition of the Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning (Pan-DL)! Our workshop is being organized in a hybrid format on December 6, 2023, in conjunction with the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). In the past year, the natural language processing (NLP) field (and the world at large!) has been hit by the large language model (LLM) "tsunami." This happened for the right reasons: LLMs perform extremely well in a multitude of NLP tasks, often with minimal training and, perhaps for the first time, have made NLP technology extremely approachable to non-expert users. However, LLMs are not perfect: they are not really explainable, they are not pliable, i.e., they cannot be easily modified to correct any errors observed, and they are not efficient due to the overhead of decoding. In contrast, rule-based methods are more transparent to subject matter experts; they are amenable to having a human in the loop through intervention, manipulation and incorporation of domain knowledge; and further the resulting systems tend to be lightweight and fast. This workshop focuses on all aspects of rule-based approaches, including their application, representation, and interpretability, as well as their strengths and weaknesses relative to state-of-the-art machine learning approaches. Considering the large number of potential directions in this neuro-symbolic space, we emphasized inclusivity in our workshop. We received 19 submissions and accepted 10 for oral presentation. This resulted in an overall acceptance rate of 52%. Our workshop also includes 6 presentations of papers that were accepted in Findings of EMNLP. In addition to the oral presentations of the accepted papers, our workshop includes a keynote talk by Yunyao Li, who has made many important contributions to the field of symbolic approaches for natural language processing. Further, the workshop contains a panel that will discuss the merits and limitations of rules in the new LLM era. The panelists will be academics with expertise in both neural- and rulebased methods, industry experts that employ these methods for commercial products, and subject matter experts that have used rule-based methods for domain-specific applications. We thank Yunyao Li and the panelists for their important contribution to our workshop! Finally, we are thankful to the members of the program committee for their insightful reviews! We are confident that all submissions have benefited from their expert feedback. Their contribution was a key factor for accepting a diverse and high-quality list of papers, which we hope will make the first edition of the Pan-DL workshop a success, and will motivate many future editions. Pan-DL 2023 Organizers December 6, 2023
more » « less
Full Text Available
Neural-Guided Program Synthesis of Information Extraction Rules Using Self-Supervision

Noriega-Atala, Enrique; Vacareanu, Robert; Hahn-Powell, Gus; Valenzuela-Escárcega, Marco A. (October 2022, Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning)

We propose a neural-based approach for rule synthesis designed to help bridge the gap between the interpretability, precision and maintainability exhibited by rule-based information extraction systems with the scalability and convenience of statistical information extraction systems. This is achieved by avoiding placing the burden of learning another specialized language on domain experts and instead asking them to provide a small set of examples in the form of highlighted spans of text. We introduce a transformer-based architecture that drives a rule synthesis system that leverages a self-supervised approach for pre-training a large-scale language model complemented by an analysis of different loss functions and aggregation mechanisms for variable length sequences of user-annotated spans of text. The results are encouraging and point to different desirable properties, such as speed and quality, depending on the choice of loss and aggregation method.
more » « less
Full Text Available
Proceedings of Pattern-based Approaches to NLP in the Age of Deep Learning (PAN-DL)

Chiticariu, Laura; Goldberg, Yoav; Hahn-Powell, Gus; Morrison, Clayton T; Naik, Aakanksha; Sharp, Rebecca; Surdeanu, Mihai; Valenzuela-Escárcega, Marco; Noriega-Atala, Enrique (October 2022, Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning)

Message from the Organizers Welcome to the first edition of the Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning (Pan-DL)! Our workshop is being organized online on October 17, 2022, in conjunction with the 29th International Conference on Computational Linguistics (COLING). We all know that deep-learning methods have dominated the field of natural language processing in the past decade. However, these approaches usually rely on the availability of high-quality and high- quantity data annotation. Furthermore, the learned models are difficult to interpret and incur substantial technical debt. As a result, these approaches tend to exclude users that lack the necessary machine learning background. In contrast, rule-based methods are easier to deploy and adapt; they support human examination of intermediate representations and reasoning steps; they are more transparent to subject- matter experts; they are amenable to having a human in the loop through intervention, manipulation and incorporation of domain knowledge; and further the resulting systems tend to be lightweight and fast. This workshop focuses on all aspects of rule-based approaches, including their application, representation, and interpretability, as well as their strengths and weaknesses relative to state-of-the-art machine learning approaches. Considering the large number of potential directions in this neuro-symbolic space, we emphasized inclusivity in our workshop. We received 13 papers and accepted 10 for oral presentation. This resulted in an overall acceptance rate of 77%. In addition of the oral presentations of the accepted papers, our workshop includes a keynote talk by Ellen Riloff, who has made crucial contributions to the field of natural language processing, many of which are at the intersection of rule- and neural-based methods. Further, the workshop contains a panel that will discuss the merits and limitations of rules in our neural era. The panelists will be academics with expertise in both neural- and rule-based methods, industry experts that employ these methods for commercial products, government officials in charge of AI funding, organizers of natural language processing evaluations, and subject matter experts that have used rule-based methods for domain-specific applications. We thank Ellen Riloff and the panelists for their important contribution to our workshop! Finally, we are thankful to the members of the program committee for their insightful reviews! We are confident that all submissions have benefited from their expert feedback. Their contribution was a key factor for accepting a diverse and high-quality list of papers, which we hope will make the first edition of the Pan-DL workshop a success, and will motivate many future editions. Pan-DL 2022 Organizers October 2022
more » « less
Full Text Available
A Human-machine Interface for Few-shot Rule Synthesis for Information Extraction

Vacareanu, Robert; Barbosa, George C.; Noriega-Atala, Enrique; Hahn-Powell, Gus; Sharp, Rebecca; Valenzuela-Escárcega, Marco A.; Surdeanu, Mihai (July 2022, NAACL)

We propose a system that assists a user in constructing transparent information extraction models, consisting of patterns (or rules) written in a declarative language, through program synthesis. Users of our system can specify their requirements through the use of examples, which are collected with a search interface. The rule-synthesis system proposes rule candidates and the results of applying them on a textual corpus; the user has the option to accept the candidate, request another option, or adjust the examples provided to the system. Through an interactive evaluation, we show that our approach generates high-precision rules even in a 1-shot setting. On a second evaluation on a widely-used relation extraction dataset (TACRED), our method generates rules that outperform considerably manually written patterns. Our code, demo, and documentation is available at https://clulab.github.io/odinsynth/.
more » « less
Full Text Available
Understanding the Polarity of Events in the Biomedical Literature: Deep Learning vs. Linguistically-informed Methods

https://doi.org/10.18653/v1/W19-2603

Noriega-Atala, Enrique; Liang, Zhengzhong; Bachman, John; Morrison, Clayton; Surdeanu, Mihai (June 2019, Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications)

An important task in the machine reading of biochemical events expressed in biomedical texts is correctly reading the polarity, i.e., attributing whether the biochemical event is a promotion or an inhibition. Here we present a novel dataset for studying polarity attribution accuracy. We use this dataset to train and evaluate several deep learning models for polarity identification, and compare these to a linguistically-informed model. The best performing deep learning architecture achieves 0.968 average F1 performance in a five-fold cross-validation study, a considerable improvement over the linguistically informed model average F1 of 0.862.
more » « less
Full Text Available
Learning what to read: Focused machine reading

https://doi.org/10.18653/v1/D17-1313

Noriega-Atala, Enrique; Valenzuela-Escárcega, Marco A.; Morrison, Clayton; Surdeanu, Mihai (September 2017, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing)

Recent efforts in bioinformatics have achieved tremendous progress in the machine reading of biomedical literature, and the assembly of the extracted biochemical interactions into large-scale models such as protein signaling pathways. However, batch machine reading of literature at today’s scale (PubMed alone indexes over 1 million papers per year) is unfeasible due to both cost and processing overhead. In this work, we introduce a focused reading approach to guide the machine reading of biomedical literature towards what literature should be read to answer a biomedical query as efficiently as possible. We introduce a family of algorithms for focused reading, including an intuitive, strong baseline, and a second approach which uses a reinforcement learning (RL) framework that learns when to explore (widen the search) or exploit (narrow it). We demonstrate that the RL approach is capable of answering more queries than the baseline, while being more efficient, i.e., reading fewer documents.
more » « less
Full Text Available
Large-scale automated machine reading discovers new cancer-driving mechanisms

https://doi.org/10.1093/database/bay098

Valenzuela-Escárcega, Marco A; Babur, Özgün; Hahn-Powell, Gus; Bell, Dane; Hicks, Thomas; Noriega-Atala, Enrique; Wang, Xia; Surdeanu, Mihai; Demir, Emek; Morrison, Clayton T (January 2018, Database)
null (Ed.)
Full Text Available

Search for: All records